home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Linux Cubed Series 2: Applications
/
Linux Cubed Series 2 - Applications.iso
/
sound
/
speech
/
abbotdem.rea
< prev
next >
Wrap
Text File
|
1996-11-16
|
12KB
|
319 lines
AbbotDemo mini-FAQ
==================
Q1: What is AbbotDemo?
----------------------
AbbotDemo is a packaged demonstration of the Abbot connectionist/HMM
continuous speech recognition system developed by the Connectionist
Speech Group at Cambridge University. The system is designed to
recognize British English and American English clearly spoken in a quiet
acoustic environment.
This demonstration system has a vocabulary of 5000 words - anything
spoken outside this vocabulary can not be recognised (and therefore will
be recognised as another word or string of words). The vocabulary and
grammar were optimised for the task of reading from a North American
Business newspaper, for example the Wall Street Journal (the word list
is given in file vocab5k.txt).
Q2: Why was AbbotDemo released?
-------------------------------
a) For information: We want to show what speech recognition systems
are capable of at the moment.
b) For publicity: Connectionist HMM systems have some advantages over
traditional HMM systems. We are open to people who wish to license
this technology and are looking for research funding to continue this
work.
Q3: How do install AbbotDemo?
-----------------------------
This is a binary only release (compilation free:). Binaries are
available from the svr-ftp.eng.cam.ac.uk FTP site in directory
comp.speech/data. The file AbbotDemo-0.x.tar.gz contains binaries for
all supported architectures. The files AbbotDemo-0.x-${OS}.tar.gz
contain complete releases for specific operating systems only. The
available architectures are SunOS, IRIX, HP-UX and Linux.
To install you need to get the appropriate binary release and extract
the files using gzip and tar. Typically this will look something like:
unix$ gunzip -c AbbotDemo-0.5.tar.gz | tar xvf -
Q4: How do run AbbotDemo?
-------------------------
The recognition system is called from the "AbbotDemo" shell script.
This script must be given an arguemtn of either "-uk" or "-us" to run
with British or American English models respectivly. For example:
unix$ ./AbbotDemo -us
A window should appear, called AbbotAudio, for controlling the recording
of the speech. A sample session is described below.
Initialization: Before processing any speech, first click on
"Calibrate". This calibrates the automatic speech start- and end-point
detection algorithm based on the background noise level. This
calibration process should be repeated whenever the speech capture
environment changes.
Speech Collection: Click on "Acquire" and say something; for example,
"President Clinton denied it". The system has a rudimentary automatic
start and end point detector and the waveform will be displayed once
recording has finished. If a waveform does not appear, check that the
input levels are set to reasonable values. There exists a "-audiogain"
flag to AbbotDemo which will pop-up an additional window for setting the
recording gain. Be sure to repeat the calibration step if the recording
levels are changed. In the event that the end-point detector is
functioning properly, clicking again on "Acquire" will cause the system
to stop recording speech.
Speech Validation: Click on "Play" to confirm the recoding quality.
This will play the sampled waveform. If you want to see a
time-frequency plot of the recorded speech, click on the "Spectrogram"
button.
Recognition: Now click on "Pipe to NOWAY" to start the recognition
process. The screen should show something like this (with each line
overwriting the last):
1
1 THE
1 THE BEST OF TWO
1 THE REST OF THE UNIT AND IN
1 PRESIDENT CLINTON DENIED IT
1 PRESIDENT CLINTON DENIED IT A
1 PRESIDENT CLINTON DENIED IT
The script prints out the best guess to the word string as the recognition
proceeds and the final recognised word string at the end. Recognition
should take about 8 Mbyte of memory and run in a few times real time on a
486DX or faster processor.
File Access: The "Import" button provides an alternate method for
aquiring the speech waveform. Clicking on this button causes the
procedure to read an ascii, linearly encoded, 16 KHz data from the file
"test_data" (in the current directory). Similarly, clicking on "Export"
causes AbbotAudio to write an ascii, linearly encoded, 16 KHz data to
the file "timeData".
There exits another flag called "-showguts". When AbbotDemo is invoked
with this flag set another window is created that shows the phonemes
that were recognised in the sentence. Like the spectrogram option in
AbbotAudio, time is displayed on the horizontal axis. The vertical axis
has one line for every phoneme in the system, the width of the line
indicates the estimate of the probability that the given phoneme was
present.
Alternatively, if you do not have X or have problems associated with
AbbotAudio, you can send prerecorded files through the recogniser by
specifing the names of the audio files on the command line. These files
should be of speech sampled at 16 kHz with 16 bits/sample in the natural
byte order and with no header. For example:
unix$ srec -t 3 -s16000 -b16 test.raw
Speed 16000 Hz (mono)
16 bits per sample
unix$ ~/AbbotDemo-0.4/AbbotDemo test.raw
0
75 A
100 BEST
125 BEST AND LOAN
150 PRESIDENT CLINTON AND IN
175 PRESIDENT CLINTON DENIED IT
1 PRESIDENT CLINTON DENIED IT
0
The file test.raw is included as an example in the 'etc' directory.
Q5: Selecting the input device
------------------------------
The input device can be selected with a command line option to
AbbotAudio or using an environment variable.
command line (checked first)
-input <input-choice> set input port
-output <output-choice> set output port
environment variable (checked if not specified on command line)
setenv ABBOTAUDIO_INPUT <input-choice> set input port
setenv ABBOTAUDIO_OUTPUT <output-choice> set output port
Where the <input-choice> is one of:
Default
SUN : mic, line mic
SGI : mic, line, digital mic
HP : mic, line line
LINUX: NONE -
and the <output-choice> is one of:
Default
SUN : speaker, headphone, line speaker
SGI : NONE -
HP : speaker, headphone, line-out, jack jack
LINUX: NONE -
Q6: Troubleshooting
-------------------
If no output:
* Did AbbotDemo produce any warning messages?
* Did a waveform appear after recording?
* Check the operation of the rest of the system with: AbbotDemo etc/test.raw
No waveform may indicate a number of trouble spots. Consider the following:
* microphone connected to inappropriate jack
* line levels are set incorrectly
* recording levels are set incorrectly
* noise level of audio front-end has unexpected characteristics
which cause problems for the speech detector. If you suspect this
to be the case, click on "Calibrate" in AbbotAudio and collect some
silence.
If poor output:
* Was the signal recorded in noise-free conditions?
* Are you putting on your best British accent?
* Are there very many out of vocabulary words?
* Is the text similar to that of a business newspaper?
Q7: Known bugs
--------------
This is the list of bugs that we know exist. We will work on these when
we get the time/funds to do so.
* AbbotAudio and x_show_guts have display problems if they are partially
overlayed with another window
* x_show_guts has a display problem whereby the phone 'blobs' are a
little wider than they should be so they can overlap
* There is a mismatch between the pronunciations used for training
the American English system and those provided in this package
Q8: Is this package supported?
------------------------------
No (but see Q2b).
If you know how to submit bug reports, then please do so.
QN-2: Legalities
----------------
The user is granted a royalty free licence to use this software as is.
No changes may be made to this software or any of the associated data
files. The complete package may be redistributed provided that no
change is made other than reasonable distribution costs. The software
may not be incorporated into any other software without prior
permission.
QN-1: Who is responsible for AbbotDemo?
---------------------------------------
Tony Robinson (Cambridge University)
Mike Hochberg (Cambridge University)
Steve Renals (Sheffield University)
Dan Kershaw (Cambridge University)
Beth Logan (Cambridge University)
Carl Seymour (Cambridge University)
Much of the funding for the recent development of this system was
provided by the ESPRIT Wernicke Project with partners:
CUED Cambridge University Engineering Department, UK
ICSI International Computer Science Institute, USA
INESC Instituto de Engenharia de Sistemas e Computadores, Portugal
LHS Lernout Hauspie SpeechSystems, Belgium
and associates:
SU Sheffield University, UK
FPMs Faculte Polytechnique de Mons, Belgium
Dedicated hardware for training the recurrent networks and system software
for that hardware were provided by ICSI.
The Perceptual Linear Prediction code was researched and implemented by
Hynek Hermansky (Oregon Graduate Institute).
The acoustic and language models for AbbotDemo were derived from materials
distributed by the Linguistic Data Consortium.
ftp://ftp.cis.upenn.edu/pub/ldc
The CMU statistical language modelling toolkit was used to generate the
trigram language model.
The BEEP dictionary was used for British English pronunciations.
ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/beep-0.6.tar.gz
The CMU dictionary was used for American English pronunciations.
ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.3.Z
The CMU phone set was expanded using code provided by ICSI.
QN: Where can I find out more?
------------------------------
Specific publications on this system include:
Tony Robinson
"The Application of Recurrent Nets to Phone Probability
Estimation", IEEE Transactions on Neural Networks, Volume 5,
Number 2, March 1994.
M M Hochberg, A J Robinson and S J Renals
"ABBOT: The CUED Hybrid Connectionist-HMM WSJ Speech Recognition
System", Proc. of ARPA SLS Workshop, Morgan Kauffman, March 1994
Mike Hochberg, Tony Robinson and Steve Renals
"Large Vocabulary Continuous Speech Recognition using a Hybrid
Connectionist HMM System", International Conference on Spoken
Language Processing, pages 1499-1502, 1994.
M M Hochberg, G D Cook, S J Renals, A J Robinson and R T Schechtman,
"The 1994 Abbot Hybrid Connextionist-HMM Large-Vocabulary
Recognition System", ARPA Spoken Language Systems, Morgan Kauffman,
1995.
Tony Robinson, Mike Hochberg and Steve Renals,
"The use of recurrent networks in continuous speech recognition",
chapter 19, Automatic Speech and Speaker Recognition - Advanced
Topics, edited by C H Lee, K K Paliwal and F K Soong, Kluwer
Academic Publishers, 1995 (hopefully).
A good tutorial on speech recognition and hybrid connectionist/HMM
techniques is:
Nelson Morgan and Herve Bourlard,
"Continuous Speech Recognition", IEEE Signal Processing magazine,
volume 12, number 3, pages 24-42, May 1995
The definitive book on this subject is:
Herve Bourlard and Nelson Morgan,
"Continuous Speech Recognition: A Hybrid Approach", Kluwer
Academic Publishers, 1993
More general information on speech recognition and pointers to tutorial
articles and books can be found in the comp.speech FAQ http://
and http://svr-www.eng.cam.ac.uk/comp.speech.
Please direct all queries to AbbotDemo@compute.demon.co.uk